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Abstract We propose a fast, accurate matching 
method for estimating dense pixel correspondences 
across scenes. It is a challenging problem to estimate 
dense pixel correspondences between images depicting 
different scenes or instances of the same object cate¬ 
gory. While most such matching methods rely on hand¬ 
crafted features such as SIFT, we learn features from 
a large amount of unlabeled image patches using unsu¬ 
pervised learning. Pixel-layer features are obtained by 
encoding over the dictionary, followed by spatial pool¬ 
ing to obtain patch-layer features. The learned features 
are then seamlessly embedded into a multi-layer match¬ 
ing framework. We experimentally demonstrate that 
the learned features, together with our matching model, 
outperform state-of-the-art methods such as the SIFT 
flow [I], coherency sensitive hashing \2 and the recent 
deformable spatial pyramid matching 3] methods both 
in terms of accuracy and computation efficiency. 

Furthermore, we evaluate the performance of a 
few different dictionary learning and feature encoding 
methods in the proposed pixel correspondence estima¬ 
tion framework, and analyze the impact of dictionary 
learning and feature encoding with respect to the final 
matching performance. 
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1 Introduction 

Estimating the dense correspondence between two im¬ 
ages across scenes is an important task, which has many 
applications in computer vision and computational pho¬ 
tography. Yet, it is a challenging problem due to large 
variations exhibited in the matching images. Conven¬ 
tional dense matching methods developed for optical 
flow and stereo usually only work well for the cases in 
which the two input images contain different views of 
the same object. Here we are interested in dense match¬ 
ing of images with different objects or scenes. This 
requires the matching algorithms to be highly robust 
to different object appearances and backgrounds, illu¬ 
mination changes, large displacements and viewpoint 
changes. For the task of matching objects in a specific 
category, the intra-class variability can be larger than 
the inter-class differences. 

Recently a few methods were proposed to address 
these challenges, including hierarchical matching [l], 
fast patch matching [2j|4], sparse-to-dense matching [5] 
and most recently spatial pyramid matching [ 3 ]. Cur¬ 
rent matching approaches typically rely on either raw 
image patches or hand-designed image features (e.g., 
SIFT features [6] ). Raw pixels or patches often lack 
the robustness to cope with those challenging appear¬ 
ance variations. Given a particular task, in order to 
model complex real-world data, robust and distinctive 
feature descriptors that can capture relevant informa¬ 
tion are needed. Hand-crafted features like SIFT have 
achieved great success in many vision tasks such as im¬ 
age classification J7], retrieval, and image matching. As 
SIFT features have passed the test of time for good per¬ 
formance, SIFT is considered as one of the milestone 
results in computer vision, which was first introduced 
more than a decade ago [6]. Despite the remarkable 
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success in a number of applications, SIFT is criticized 
for drawbacks such as its large computational burden, 
and being incapable to well accommodate affine view¬ 
point transformation. Researchers have been seeking 
improved feature descriptors. However, manually de¬ 
signing features for each data set and task can be very 
expensive, time-consuming, and typically requires do¬ 
main knowledge of the data. In recent years, researchers 
observed that instead of manually designing features us¬ 
ing heuristics, learning features from a large amount of 
unlabeled data with some unsupervised machine learn¬ 
ing approaches achieves tremendous success in various 
applications. For example, in visual recognition the un¬ 
supervised feature learning pipeline has now become 
the common approach [8j[9]. Feature learning is attrac¬ 
tive as it exploits the availability of data and avoids the 
need of feature engineering [7] . For unsupervised feature 
learning, its main advantage is that unlabeled domain- 
specific data are usually abundant and very cheap to 
obtain. Inspired by the success of mm , we pro¬ 
pose unsupervised feature learning for dense pixel cor¬ 
respondence estimation within a multi-layer matching 
framework. The outline of our multi-layer model is il¬ 
lustrated in Figure [l] 


In our framework, features at the bottom layer 
(namely, the pixel layer) are extracted from raw im¬ 
age patches using unsupervised feature learning meth¬ 
ods. We then obtain more compact representations of 
larger-size nodes at higher-level layers, which achieve 
better robustness to noise and clutter, thus better deal 
with severe variations in object or scene appearances. 
Larger spatial nodes with more compact features pro¬ 
vide better geometric regularization when the match¬ 
ing objects undergo large appearance variations, while 
smaller spatial nodes with more detailed features obtain 
finer correspondence. Our matching starts from the top 
layer (i.e., the grid-cell layer). The matching solution of 
a higher layer provides reliable initial correspondences 
to the lower layer. 

We apply several well-known unsupervised feature 
learning algorithms to extract pixel layer features. Then 
we present a detailed analysis on the impact of vari¬ 
ous parameters and configurations of our framework— 
the matching model as well as the unsupervised fea¬ 
ture learning techniques. Despite the simplicity of our 
system, our framework outperforms all previously pub¬ 
lished matching accuracy on the Caltech-101 dataset, 
the LMO dataset 11 , and a subset of the Pascal 


dataset 12 . Our results demonstrate that it is pos¬ 
sible to achieve state-of-the-art performance by using 
a tailored matching framework, even with simple unsu¬ 
pervised feature learning techniques. 


Our main contributions are thus as follows. 


— We apply unsupervised feature learning to the prob¬ 
lem of dense pixel correspondence estimation, rather 
than using hand-designed features. Experiment re¬ 
sults show that our method outperforms recent 
state-of-the-art methods 11||3] in terms of both ac¬ 
curacy and running time. Our experiments demon¬ 
strate that the learned features can well handle vari¬ 
ations of different factors. 

— Inspired by the recent development in multi-layer 
networks and deep learning methods, we perform 
matching at several levels of the image representa¬ 
tions (grid-cell layer, patch layer, pixel layer). Our 
multi-layer matching model, designed for fast and 
accurate matching, is suitable for the multi-layer 
unsupervised feature learning pipeline. 

— We use the patch-layer feature as the basic unit to 
estimate correspondences in the patch-layer match¬ 
ing such that the computation time is considerably 
faster (due to less time spent on feature extraction 
and fewer variables to optimize) while still keeping 
the desirable power of learned features. Matching re¬ 
sults at the patch layer have already outperformed 
those state-of-the-art methods in the literature in 
terms of both matching accuracy and efficiency. 

— We evaluate the performance of a few dictionary 
learning and feature encoding methods in the pro¬ 
posed pixel correspondence estimation framework. 
Moreover, we study the effect of parameter choices 
on the features learned by several feature learning 
methods. Several important conclusions are drawn, 
which are different from the case of unsupervised 
feature learning for image classification [8]. 


1.1 Related work 


We briefly review some relevant work in dense match¬ 
ing and unsupervised feature learning. Estimation of 
dense correspondences between images is essential for 
many computer vision tasks such as image registration, 
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segmentation [13], stereo matching and object recogni¬ 
tion 


It is challenging to estimate dense correspon¬ 


dences between images that contain different scenes. 


Graph matching algorithms 14 16 were introduced 


to find the dense correspondence. Typically these meth¬ 
ods use sparse features and rely on geometric relation¬ 
ships between nodes. Optical how methods have been 
used to estimate the motion held and dense correspon¬ 
dence in the literature. Recently, SIFT Flow [l adopts 
the computational framework of optical how and pro¬ 
duces dense pixel-level correspondences by matching 
SIFT descriptors instead of raw pixel intensities. A 
coarse-to-hne matching scheme is used in their method 
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Fig. 1 Illustration of our multi-layer matching and unsupervised feature learning model. First column shows the feature 
extraction process of each layer. Second column shows the node structure of each layer. Third column outlines the matching 
pipeline. The learned features at the pixel-level layer within a patch are spatially pooled to form a patch-level feature. Here the 
grid-cell feature is the concatenation of patch-level features within a cell. The matching result from the grid-cell layer guides 
the matching at the patch-level layer; and the result at the patch-level layer guides the matching at the pixel-level layer. In our 
experiments, the matching accuracy obtained by the patch-level layer is already very high. Pixel-layer matching can further 
improve the matching accuracy. 


to speed up the matching procedure. Kim et al. [3] pro¬ 
posed a deformable spatial pyramid (DSP) model for 
fast dense matching. Their model regularizes match¬ 
ing consistency through a pyramid graph. The match¬ 
ing cost in DSP is defined by using multiple SIFT de¬ 
scriptors. PatchMatch [4] and more recent work of co¬ 
herency sensitive hashing (CSH) [2] are much faster in 
finding the matching patches between two images, but 
abandon explicit geometric smoothness regularization 
for the speed, which may lead to noisy matching re¬ 
sults due to negligence of pixels’ geometric relations. 
Leordeanu et al. [5 proposed to extend sparse match¬ 
ing to dense matching by introducing local constraints. 

Image matching in general consists of two com¬ 
ponents: local image feature extraction and feature 
matching. First, one must define the image features 
based upon which the correspondence between a pair 
of images can be established. An ideal image descrip¬ 
tor should be robust so that it does not change from 
one image to another. Many methods use SIFT fea¬ 


tures as local descriptors because of their robustness 
to scale and illumination changes, etc. Recent work 
showed that, to some extent, carefully designed descrip¬ 
tors may improve the matching results [17,18]. In 
it is shown that SIFT features extracted at multiple 
scales lead to much better matches than single-scale 
features. All these features were manually designed. In¬ 
stead, our work here is inspired by those feature learn¬ 
ing approaches that first appeared in image classifica¬ 
tion. 
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In recent years, a large body of work on generic im¬ 
age classification/categorization has focused on learn¬ 
ing features in an unsupervised fashion (8,9 . Unsu¬ 
pervised feature learning (or deep learning by stack¬ 
ing unsupervised feature learning) has emerged as a 
promising technique for designing task-specific features 
by exploiting a large amount of unlabeled data 19 . The 
main purpose of unsupervised feature learning is to de¬ 
sign low-dimensional features that capture some struc¬ 
ture underlying the high-dimensional input data. Typ- 
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ical unsupervised feature learning methods include in¬ 
dependent component analysis [20], auto-encoders 
sparse coding 


23], (nonnegative) matrix factoriza- 

In 
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24 


22 


, and a few clustering methods 25 


tion 

terms of large-scale sparse coding and matrix factor¬ 
ization based feature learning, an online optimization 
algorithm based on stochastic approximations was in¬ 


troduced in 23 . 


Low-level image alignment such as dense stereo 
matching, which shares similarity with the matching 
task that we concern here, often use hand-crafted lo¬ 


cal image descriptors 26 . Traditional local feature de¬ 


scriptors like SIFT was shown their values for dense 
wide-baseline matching, but with limited success. This 
is mainly because their high computational cost and 
sensitivity to occlusions. The SURF feature 27 tries 


to speed up the computation of local features. In 28 


Tola et al. designed the DAISY feature for fast and ac¬ 
curate wide-baseline stereo matching. The DAISY fea¬ 
ture attempts to solve both the computation and oc¬ 
clusion problems in stereo matching. Another compu¬ 
tationally cheap local feature descriptor is a modified 


version of the local binary pattern (LBP) feature 29 


In the sparse matching experiment, Heikkila et al. have 
showed that the LBP descriptor performs favorably 


compared to the SIFT 29 . Estimating the dense cor¬ 


respondence between images depicting different scenes, 
which we concern here, is a much more challenging 
problem compared to dense stereo matching. To our 
knowledge, for dense correspondence estimation across 
scenes, to date the SIFT feature is still the standard 
due to its very good performance. 


Our method is closely related to the approach of [3] 
that applies only two layers of matching, namely, grid 
cell layer and pixel layer. There both layers are repre¬ 
sented by the same type of features. They utilize sparse 
sampling to reduce the complexity and expense of large- 
node representations, which may cause loss of discrim¬ 
inative information. Instead of using SIFT features as 
the descriptor as in [3], we learn features from a large 
amount of small patches, which are randomly extracted 
from natural images. In our method, dense matching is 
performed at several levels of the image representations 
(grid-cell layer, patch layer, pixel layer). For each layer, 
we obtain suitable features to represent image nodes. 
Compared to the bottom layer, features in higher layer 
are extracted to achieve more robustness to the noise 
and clutter and more compactness of representation. By 
using the max-pooling operation, we obtain more com¬ 
pact representations of larger image nodes while remov¬ 
ing irrelevant details. We demonstrate the efficiency and 
effectiveness of the learned features over hand-crafted 
features for dense matching task. 


The second approach that has inspired our work 
is [lO . The main idea is to combine unsupervised joint 
alignment with unsupervised feature learning. Huang 
et al. used unsupervised feature learning, in particular, 
deep belief networks (DBNs) to obtain features that can 
represent an image at different resolutions based on net¬ 
work depth 10 . There are major differences between 
our work and 10 although both have used unsuper¬ 
vised feature learning. Huang et al. considered the prob¬ 
lem of congealing alignment of images, which is to esti¬ 
mate the parametric image transform. One only needs 
to optimize for a small number of continuous variables 
(typically the rotation matrix and translation). Thus 
the number of variables is independent of the image 
size 10 . In Huang et al. 10 , gradient descent is em¬ 


ployed for this purpose. In contrast, we estimate the 
nonparametric correspondences at the pixel level. The 
optimization problem involved in our task is a much 
more challenging discrete combinatorial problem. It in¬ 
volves thousands of discrete variables. Thus we use be¬ 
lief propagation to achieve a locally optimal solution. 
It is unclear how the method of 10 may be applied to 
dense correspondences. 


2 Our approach 

In this section, we first describe how to extract fea¬ 
tures for each level in Section |2T| Then, we present our 
framework in detail in Section lT2l 


2.1 Multi-layer image representations 

In this work, we follow the standard unsupervised fea¬ 
ture learning framework in [8]|9], which has been suc¬ 
cessfully applied to generic image classification. A com¬ 
mon feature learning framework performs the following 
steps to obtain feature representations: 

1. The dictionary learning step learns a dictionary 
using unsupervised learning algorithms, which are 
used to map input patch vectors to the new feature 
vectors. 

2. The feature encoding step extracts features from im¬ 
age patches centered at each pixel using the dictio¬ 
nary obtained at the first step. 

3. The pooling step spatially pools features together 
over local regions of the images to obtain more 
compact feature representations. For classification 
tasks, the learned features are then used to train a 
classifier for predicting labels. In our case, we esti¬ 
mate dense correspondences in a multi-layer match¬ 
ing framework using the learned multi-level feature 
representations. 
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Next, we briefly review the pipeline of feature learn¬ 
ing framework. As mentioned above, there are three key 
components in the feature learning framework: dictio¬ 
nary learning, feature encoding and pooling operation. 

2.1.1 Dictionary learning 


layer. We consider the following coding methods in this 
work: 

(a) K-means triangle (KT) encoding: This can be 
viewed as a ‘soft’ encoding method while keeping spar¬ 
sity of codes. With the M basis vectors {d j} learned 
by the first stage, KT encodes the patch as: 


To learn a dictionary, we first extract N random patches 
from a collection of natural images as training data, 
x^ G R n ,(i = 1,2, ...,7V), and then pre-process these 
patches as described in 19 . Every patch is normalized 
by subtracting the mean and normalized by the stan¬ 
dard deviation of its elements. This is equivalent to local 
brightness and contrast normalization. We also apply 
the whitening process as done in |9l. Then we use an 
unsupervised learning algorithm to construct the dic¬ 
tionary D = [di,d 2 ,..., d m\ U R nxM . Here M is the 
dictionary size, and each column d j is a codeword. 

We consider the following dictionary learning meth¬ 
ods for learning the dictionary D: 

(a) K-means clustering: We learn M centroids 
= 1,2,... , M from the sampled patches. K- 
means has been widely adopted in computer vision for 
building codebooks but less widely used in ‘deep fea¬ 
ture learning’ [9]. This may be due to the fact that 
K-means is less effective when the input vectors are of 
high dimension. 


(b) K-SVD 30 : The dictionary is trained by solving 
the following optimization problem using alternating 
minimization: 


N 

min I 

D,S i^i 


X ? ; 


•Dsi|| 2 , S.t. 


Mo < fc,Vz, 


(1) 


Sij = max {0, /i(z) — zj }, 

where is the j-th component of feature vector s^; 
Zj = ||xi — d j || and /i(z) is the mean of z. By using this 
encoder, roughly half of the features are set to 0. 

(b) Soft-assignment (SA) encoding: 

exp (—pptj -dj-Ha) 

( Q II A II 2\ 

z2l=l ex P ll x i — d/|| 2 ) 

Here (3 is the smoothing factor controlling the softness 
of assignment. 

(c) Orthogonal matching pursuit (OMP-k) encod¬ 
ing: Given the patch x^ and dictionary D, we use OMP- 
k [8] to obtain the feature s$, which has at most k non¬ 
zero elements: 

N 

m s in Ell^-rNlU-t. IN| 0 < ft, Vi, (3) 

i =1 

where 11 s* 11 0 is the number of non-zero elements in s*. 
This explicitly enforces the code’s sparsity. 

We mainly use K-means for dictionary learning and 
K-means triangle (KT) for encoding in our experiments. 
In the last part of Section |3.3| we evaluate the perfor¬ 
mance of different learning and encoding methods men¬ 
tioned above in dense correspondence estimation. 


where S = [si, S 2 ,..., Sat] G R MxAr are the sparse 
codes. ||si|| 0 is the number of non-zero elements in s$, 
which enforces the code’s sparsity. Here ||-|| 2 and ||-|| 0 
are the and Lq norm respectively. Note that to solve 
for S, usually one seeks an approximate solution be¬ 
cause the optimization problem is NP-hard. 

(c) Random sampling (RA): M patches are ran¬ 
domly picked from the N patches to form a dictionary. 
Therefore no learning/optimization is performed in this 
case. 

Later we test these three dictionary learning meth¬ 
ods on the problem of dense matching. 

2.1.2 Feature encoding 

After obtaining the dictionary D, we extract patches 
centered at each pixel of the pair of matching images 
after applying pre-processing. The patch vector x^ is 
encoded to generate the feature vector at the pixel 


2.1.3 Pooling operation 


The general objective of pooling is to transform the 
joint feature interpretation into a new, more usable one 
that preserves important information while discarding 


irrelevant details 31 . Spatially pooling features over a 


local neighbourhood to create invariance to small trans¬ 
formation of the input is employed in a large number 
of models of visual recognition. The pooling operation 
is typically a sum, an average, a max or more rarely 
some other commutative combination rules. In this pa¬ 
per, we apply the max-pooling operation to obtain the 
patch-layer features. 

The pixel feature s* G R M is the code of the 
patch centered at pixel i, which is obtained at the 
feature encoding step. At the patch layer,the image is 
partitioned using a uniform grid into non-overlapping 
square patches. Each patch feature f = [/i, • • • , /j, • • • , 
f M \ u R m is obtained by max-pooling all pixel features 
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within that patch, which is simply the component-wise 
maxima over all pixel features within a patch P: 

fj = max s i:j (4) 

where i ranges over all entries in image patch P. Thus 
each patch feature has the same dimension as pixel 
features. Note that the max-pooling operation is non¬ 
linear. It captures the main character of pixels in the 
patch while maintaining the feature length and reduc¬ 
ing the feature number. Detailed discussions on the im¬ 
pact of feature learning methods are presented in Sec¬ 
tion [rU 

2.1.4 Grid-cell layer representations 

The grid-cell layer is built on the patch-level layer. The 
structure of the grid-cell layer is a spatial pyramid, as 
shown in Figure |2j Thus it can contain multiple levels. 
Each level contains a number of cells at different reso¬ 
lutions. The cell size starts from the whole image (the 
top level at Figure [ 2 ]) to a certain spatial size according 
to the number of pyramid levels. A cell node is much 
larger than an image patch and may offer greater reg¬ 
ularization when appearance matches are ambiguous. 
Each cell is represented by a grid-cell feature, called 
a cell node. Grid-cell features are formed by concate¬ 
nating the patch layer features within a cell. For all the 
experiments in the Section [3j we use the 3-level pyramid 
as shown in Figure [2] At the top level of the pyramid, 
there is one single cell node which is the whole image. 
The middle level contains 4 equal-size non-overlapping 
cell nodes; and the bottom level has 16 cell nodes. 

To demonstrate the great potential of unsupervised 
feature learning techniques in the dense matching task, 
which generally requires features to preserve visual de¬ 
tails, we decide not to use those complex learning algo¬ 
rithms and models. Instead, we employ simple learning 
algorithms and design a tailored matching model for 
the learned features to estimate pixel-level correspon¬ 
dences. The experiments in Section [3] demonstrate that 
it is possible to achieve state-of-the-art performance 
even with simple algorithms of unsupervised feature 
learning. 

2.2 The proposed matching framework 
2.2.1 The multi-layer model 

Our matching model consists of three layers: the grid¬ 
cell layer, patch-level layer and pixel-level layer. 

(a) model structure: The grid-cell layer is the top 
layer, which is a conventional spatial pyramid (we use 


a 3-level pyramid for all the experiments in this work). 
The cell size starts from the whole image to the pre¬ 
defined cell size. Grid-cell node features are formed by 
concatenating the patch level features within a cell. The 
patch-level layer lies underneath the grid-cell layer. The 
bottom layer is pixel-level layer. 

(b) node definition: To be clear, in our model, at 
the grid-cell layer, each cell can be seen as a node. At 
the patch-level layer, each patch represents a node. At 
the pixel-level layer, a single pixel is a node. 

(c) node linkage: In the pyramid of the grid-cell 
layer, each node links to the neighboring nodes within 
the same pyramid level as well as parent and child nodes 
in the adjacent pyramid levels, as shown in Figure [2] 
We define the node at the higher level layer as the par¬ 
ent of the nodes within its spatial extent at the lower 
layer. For the bottom two layers, namely the patch-level 
layer and the pixel-level layer, each node is only linked 
to the parent node. 

Figure [l] shows our matching pipeline. Our matching 
process starts from the grid-cell layer matching. At this 
layer, the matching cost and geometric regularization 
are considered for the pyramid node of different spatial 
extents. Matching results of the grid-cell layer guide 
the patch-level matching. In other words, results of the 
grid-cell layer offer reliable initial correspondences for 
the patch-level matching as generally larger spatial sup¬ 
ports provide better robustness to image variations. At 
the patch level layer, we estimate the correspondences 
between the patch nodes of image pair. Guided by the 
grid-cell matching results, in our framework, the patch- 
level matching can already achieve high matching ac¬ 
curacy and efficiency. In [lj 3| the authors sub-sample 
pixels to reduce the computation cost, which may lead 
to suboptimal solutions. In contrast, we do not need 
to sub-sample pixels because the patch layer match¬ 
ing in our framework is is extremely fast, which is one 
of the major advantages of our method. At the pixel- 
level layer, the pixel matching refines the results of the 
patch-layer matching. Figure [3] shows the comparison 
of matching results from the pixel layer and the patch 
layer. We can see that the pixel layer matching provides 
finer visual results with heavier computation. 


2.2.2 Matching objective 

For the grid-cell layer, denotes the center coordinate 
of patches within the cell node i. Let E — (rq,^) be 
the translation of node i from the test image to the 
exemplar image. By minimizing the energy function <§, 
we obtain the optimal translation of each node in the 
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grid-cell layer nodes pyramid structure 



Fig. 2 Illustration of the pyramid structure and node connection of the grid-cell layer. Note that the grid-cell layer can contain 
multiple pyramid levels (here we have 3 levels). 



Fig. 3 Patch-level matching results vs. pixel-level matching results of our method. 


(a) and (b) 


are the test image and exemplar 
image respectively. Images [(c)] and [(d) | are the patch-level result and the pixel-level matching result of our method, respectively. 
We can see that the pixel-level matching provides refined visual results. 


grid-cell layer. The objective function is defined as: 
E{t) = Y^Di(U)+ U ^2 ( 5 ) 

where Di is the data term; Vij is the smoothness term; 
and a is the constant weight. A f represents node pairs 
linked by graph edges between the neighboring nodes 
and the parent-child nodes in the spatial pyramid. In 
the above equation, the data term D^(t^) is defined as: 

A(t») = - min(||f t (q i ) - f e (q» + tj)^ , A), ( 6 ) 

z 

where 2 is the total number of patches included in the 
grid-cell node i. f t (q^) is the cell node feature of the test 
image centered at coordinate and f e (q^ + t*) is the 
cell node feature of the exemplar image under certain 
translation t*. Di(ti) measures the similarity between 
node i and the corresponding node in the exemplar im¬ 
age according to translation Here A is a truncated 
threshold of feature distance. We set it to the mean dis¬ 
tance of pairwise pixels between image pairs, jj*!^ is the 
L\ norm. 

Second, the smoothness term is defined as: 

Vijit^tj) = min(||tj — tj || x , 7 ), (7) 

which penalizes the matching location discrepancies 
among the neighboring nodes. We use loopy belief prop¬ 
agation (BP) to find the optimal correspondence of each 


node. Although BP is not guaranteed to converge, it 


has been applied with much experimental success 32 


In our objective function, truncated L\ norms are used 
for both the data term and the smoothness term. The 
smoothness term accounts for matching outliers. As 
in |[3 33 , for BP, we use a generalized distance trans¬ 
form technique such that the computation cost of mes¬ 
sage passing between nodes is reduced. 

For the patch-level layer, each patch links to the 
parent node in the grid-cell layer, q^ denotes the center 
coordinate of each patch in the patch-level layer. The 
patch’s optimal translation can be obtained by: 


A(t) = min(||ft(q;) - f e (q* +t)|| 1 ,A)> 

tj = axgmin(A(t) + aV ip (t,tp)). ® 

t 

Here t$ and t p are the patch V s optimal translation 
and its parent cell node’s optimal translation respec¬ 
tively. f t and f e denote the patch features in the test 
image and exemplar image, respectively. Note that the 
results from the grid-cell layer provide reliable initial 
correspondences for the patches at the patch layer. 

For the pixel-level layer, each pixel links to the par¬ 
ent patch node. Guided by the patch layer solution, 
pixel layer correspondences can be estimated efficiently 
and accurately, q^ denotes the pixel V s coordinate in the 
pixel-level layer. The Di( t) in pixel i’s optimal transla- 
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tion is defined as 


A(t) = min(||st(q») - s e (qj + t)|] a , A), 

t i = argmin(A(t) + aV ip (t,t p )). ^ 

t 

Here t i and t p are the pixel V s optimal translation and 
its parent patch node’s optimal translation respectively. 
St and s e denote the pixel features in the test image and 
exemplar image respectively. 

2.2.3 Discussion 

Our method improves the matching accuracy and sub¬ 
stantially reduces the computation time compared to 
the recent state-of-the-art method [ 3 ]. In [3], they have 
used sparse descriptors sampling to reduce the compu¬ 
tational time, which may cause loss of key characteris¬ 
tics in the nodes of the matching graph. 

The grid-cell layer of our matching model is built 
upon the patch layer features which cover the whole 
image area without using a sparse sampling process. 
The pixel-layer features obtained by an unsupervised 
learning algorithm appear to be discriminative for re¬ 
solving matching ambiguity between classes. By using 
a pooling operation on the pixel features, we signifi¬ 
cantly reduce the number of patch-layer features and 
the possible translations, while enhancing the robust¬ 
ness to severe image visual variations. Matching at the 
patch-level layer is the core of our algorithm. At the 
patch layer, image pixels within the same patch share 
the same optimal translations. Our experiment results 
show that the patch-layer matching results outperform 
the state-of-the-art methods with a much faster com¬ 
putation speed. The pixel-level matching further refines 
the matching accuracy. Thus the pixel-level matching 
procedure can be considered optional. If the test speed 
is on a budget, one does not have to perform the pixel- 
level matching. 

As can be seen from Figure [3j the pixel-level match¬ 
ing output (Figure |3(d)| ) retains finer details in ob¬ 
ject edges, compared to the patch-level matching re¬ 
sult (Fignre [3(c)| . Our pixel and patch feature encoding 
schemes allow us to reduce the computation while im¬ 
proving the matching results. Experiment results in the 
next section demonstrate the advantages of our method. 


In Section |3.2[ we test our method on three bench¬ 
mark vision datasets: the Caltech -101 dataset, the La- 
belMe Outdoor (LMO) dataset 11 , and a subset of the 
Pascal dataset. 

We also apply our method to semantic segmen¬ 
tation. We compare our method with state-of-the-art 
dense pixel matching methods, namely, the deformable 
spatial pyramid (DSP) approach (single-scale) 3], SIFT 
Flow (SF) 11], and coherency sensitive hashing (CSH) 
[ 2 ]. Note that DSP has achieved the previously best re¬ 
sults on dense pixel matching. For the DSP, SF and 
CSH methods, we use the code provided by their au¬ 
thors. 

In Section |T3j we present a detailed analysis of the 
impact of parameter settings in feature learning, includ¬ 
ing: (a) the choice of the unsupervised feature learning 
algorithm, (b) the impact of the dictionary size, (c) the 
impact of the training data size, (d) different configu¬ 
rations in the patch feature extraction process. 

We set the parameters of the compared methods to 
the values that were suggested in the original papers. 
In all of our experiments, we use 3-level pyramid in the 
grid-cell layer. The parameters of our method for all 
experiments are fixed to a = 0.02, 7 = 0.5. 

A universal dictionary is learned from image 
patches extracted from 200 Background_Google class 
images in the Caltech-101 dataset. Note that ‘Back- 
ground_Google’ contains mainly natural images which 
are irrelevant to the test images in our experiments. 
The dictionary is learned before the matching process. 
Once the dictionary is learned, it is used to encode all 
the test images. We use K-means dictionary learning 
and K-means triangle (KT) encoding for our method. 
The dictionary size is set to 100. Clearly, the length of 
feature vectors at the pixel-level layer and patch-level 
layer are equal to the dictionary size. 

Then pixel-layer features are computed at each pixel 
of test images. More specifically, we perform encoding 
on an image region around each pixel centroid using 
the learned dictionary to form the feature vector of 
that pixel. So each pixel feature is extracted from an 
11 x 11-pixels image region centered at that pixel. Patch 
features are then calculated by max-pooling pixel fea¬ 
tures within each non-overlapping patch of 7 x 7 pixels. 
The grid-cell layer is constructed by a 3-level pyramid 
as shown in Figure [ 2 ] 


3 Experiments 


3.1 Evaluation metrics 


We conduct experiments to evaluate the matching qual¬ 
ity (Section |3.2| ) and to analyse the performance im¬ 
pact of several different elements in the feature learning 
framework (Section 3.3). 


Following ( 3 ], we use the label transfer accuracy (LT- 
ACC) 1111, intersection over union (IOU) metric fl2] 
and the localization error (LOC-ERR) [ 3 ] to measure 
the quality of dense image matching. 
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For each image pair (a test image and an exemplar 
image), we find the pixel correspondences between them 
using a matching algorithm and transfer the annotated 
class labels of the exemplar image pixels to the test 
image pixels. 

The LT-ACC criterion measures how many pixels 
in the test image have been correctly labelled by the 
matching algorithm. On the Caltech-101 dataset, each 
image is divided into the foreground and background 
pixels. The IOU metric reflects the matching quality for 
separating the foreground pixels from the background. 
As for Caltech-101, since the ground-truth pixel cor¬ 
respondence information is not available, we use the 
LOC-ERR metric to evaluate the distortion of corre¬ 
spondence pixel locations with respect to the object 
bounding box. 

Mathematically, the LT-ACC metric is computed as 


r = 


1 

Ei "I* 


E E 

i peAi 


= a(p), a{p) > 0), 


( 10 ) 


where for pixel p in image i, the ground-truth annota¬ 
tion is a(p ) and matching output is o(p); for unlabeled 
pixels a(p) = 0. Notation Ai is the image lattice for 
image i, and rrii = J2 peA . 1 (a(p) > 0) is the number of 
labeled pixels for image i 'm . Here l(-) outputs 1 if the 
condition is true; otherwise 0. 

To define the LOC-ERR of corresponding pixel po¬ 
sitions, we first designate each image’s pixel coordi¬ 
nates using its ground-truth object bounding box. Pixel 
coordinates are normalized with respect to the box’s 
position and size. Then, the localization error of two 
matched pixels is defined as: e = 0.5(|#i — ^ 2 1 + 12/1 — 2 / 2 1)? 
where (# 1 , 7 / 1 ) is the pixel coordinate of the first image 
and (# 2 , 3 / 2 ) corresponding location in the second 

image [ 3 . 

The intersection over union (IOU) segmentation 
measure is used to assess per-class accuracy on the in¬ 
tersection of the predicted segmentation and the ground 
truth, normalized by the union. Formally, it is defined 
as 


true pos. 

u = - 

true pos. + false pos. + false neg. 

IOU is now the standard metric for segmentation 


(ii) 
12. 


3.2.1 Comparison with different features and matching 
methods 

For this experiment, we evaluate the matching perfor¬ 
mance by using different features in different matching 
frameworks. 100 test image pairs are randomly picked 
from the Caltech-101 dataset. Each pair of images are 
chosen from the same class. The result is shown in Ta¬ 
ble [l] It shows that SIFT features in our multi-layer 
matching model are not able to achieve the same ac¬ 
curacy level as the learned features. This result shows 
the advantage of learned features over hand-crafted fea¬ 
tures such as SIFT. 

Meanwhile, we use the learned features to replace 
the SIFT features in the DSP [3]’s framework, and 
use the same test images. The results show that the 
SIFT features can obtain better matching accuracy in 
DSP (3] ’s framework. This result shows that the DSP 
method [ 3 ] is not able to take advantage of learned fea¬ 
tures, while our matching framework is tailored to the 
unsupervised features learning technique. 

The third observation is the computation speeds of 
compared methods. The CPU time of our method (at 
the patch level) is about 8 times faster than that of 
DSP |3 and 50 times faster than SIFT Flow of Liu et 
al. 0. 

For the patch-level matching, our method outper¬ 
forms CSH by about 13 points in LT-ACC, yet is twice 
faster than CSH. Note that CSH is a noticeably fast 
method, which exploits hashing to quickly find match¬ 
ing patches between two images. Our pixel-level match¬ 
ing further improves the patch layer matching accuracy 
and provides better visual matching quality, which is 
hard to be measured by LT-ACC. The examples are 
shown in Figure [3j 

By using multi-level representations, our proposed 
matching method enables the learned features (ob¬ 
tained by unsupervised learning methods) to outper¬ 
form those hand-crafted features (e.g., SIFT features) 
in dense matching tasks. The general matching frame¬ 
work may not help features to achieve their best per¬ 
formance. A suitable matching framework improves the 
feature performance in the matching task. We conclude 
that our matching framework and the unsupervised fea¬ 
ture learning pipeline are tightly coupled to achieve the 
best performance. 


3.2 Comparison with state-of-the-art dense matching 

methods 3.2.2 Intra-class image matching 


In this section, we compare our method against state-of- 
the-art dense matching methods to examine the match¬ 
ing quality on object matching and scene segmentation. 
Detail about our method is described in Section l2Tl 


Now we conduct extensive experiments on the Caltech- 
101 dataset. We randomly pick 20 image pairs for each 
class (2020 image pairs in total). Each pair of images 
are chosen from the same class. The ground-truth an- 
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framework 
feature 
matching level 

learned 
patch level 

Oi 

feature 
pixel level 

irs 

SII 

patch level 

pixel level 

DSP U 

learned feature 

SIFT 

SIFT Flow jlj 

CSH EJ 

LT-ACC 

0.801 

0.803 

0.757 

0.759 

0.765 

0.792 

0.763 

0669 

IOU 

0.501 

0.505 

0.449 

0.451 

0.436 

0.496 

0.479 

0.365 

LOC-ERR 

0.323 

0.324 

0.409 

0.410 

0.359 

0.357 

0.351 

1.002 

Time (sec) 

0.07 

0.45 

0.09 

0.76 

0.40 

0.65 

4.32 

0.16 


Table 1 Comparison of object matching performance of different methods on 100 pairs of images from the Caltech-101 dataset 
in terms of the matching accuracy and speed. The best results are shown in bold. 


method 

Ours (patch layer) 

DSP 

SIFT Flow 

CSH 

LT-ACC 

0.772 

0.761 

0.741 

0.590 

Time (s) 

0.033 

0.545 

2.82 

0.163 


Table 2 Intra-class image matching performance on the 
Caltech-101 dataset. The best results are in bold. 


■CSH 

□SIFTFLOW 

MDSP 



26 % 


(a) 



LT-ACC 

(b) 


Fig. 4 (a) shows the percentage of each method achieving 

the best performance in all of the 101 classes. Our method 
achieves the best matching accuracy in 55 classes (54% of 101 
classes). |(b)| shows the histogram of each methods’ achieve¬ 
ments over matching accuracy (LT-ACC) in all classes. 


notation for each image is available, indicating the fore¬ 
ground object and the background. 

Table [2] shows the matching accuracy and CPU time 
of our method against those state-of-the-art methods. 
As can be seen, our method achieves the highest label 
transfer accuracy and is more than 10 times faster than 
the DSP in the matching process. In this experiment, 
there are only two labels for each test image and the 
intra-class variability is very large. It is hard to achieve 
improvement of 0.1. 


We can find that the intra-class variability differs 
by classes. Such as in the ‘Face’ class, objects in im¬ 
ages are similar. The highest matching accuracy for the 
‘Face’ class can reach 0.952 (obtained by the SIFT Flow 
method). However, in the ‘Beaver’ class, images vary 
much more than the ‘Face’ class and are hard to match. 
Therefore, the best accuracy among four compared 
methods of the ‘Beaver’ class is lower than the ‘Face’ 
class, as expected. In the ‘Beaver’ class, our method 
outperforms other methods and obtains a matching ac¬ 
curacy of 0.687. 


As the intra-class variability is hard to measure, we 
use the best matching accuracy to reflect the variability 
of each class. Higher matching accuracy means smaller 
intra-class variability. Figure [4(b) | shows the histogram 
of best matching accuracy in all classes. For each bin, 
we group the data by matching methods. 

Figure |4(b)| shows that our method outperforms 
other methods in most cases. SIFT based methods 
achieve better results in those ‘easy’ high-accuracy 
classes such as ‘Face’. Our method can achieve better 
matching results for large intra-class variability classes. 
SIFT based methods can well handle the similar ap¬ 
pearance objects matching. This conclusion is consis¬ 
tent with the case of sparse point matching applica¬ 
tions. Our proposed method is more suitable to han¬ 
dle the matching problem of large intra- class variability 
than compared methods. 


The pie chart in Figure 4(a) shows the percentage 
of each method achieving the best accuracy in the 101 
classes. Our method achieves the best matching accu¬ 
racy in 55 out of 101 classes. Our method outperforms 
the compared methods by a large margin. 


Figure [5] shows some qualitative results of the com¬ 
pared methods. The results show that our method is 
more robust than other methods under large object ap¬ 
pearance variations (e.g., the second example) and clut¬ 
tered backgrounds (e.g., the first and third examples). 


The DSP and SIFT Flow methods are based on 
SIFT features which are robust to local geometric and 
local affine distortions. The SIFT Flow method enforces 
the matching smoothness between the pixel and its 
neighboring pixels. Due to the enforcement of pixels’ 
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Input GT label Ours DSP SIFT Flow CSH 



Fig. 5 Qualitative comparison. We show some example results of compared methods. The first column stacks matching image 
pairs. The second column shows ground-truth labels of input images. We establish the pixel correspondences for the top image 
of each image pair. Columns 3-6 show the warping results from the exemplar image (the bottom one in the pair) to the test 
image (top one) via pixel correspondences by our method, DSP, SIFT flow and CSH, respectively. Here black and white colors 
indicate the labels of background and target. 


connections, the SIFT Flow matching results may ex¬ 
hibit large distortions. 

The DSP method takes advantages of pyramid 
graph matching and focuses on the pixel level optimiza¬ 
tion efficiency. The results show that the DSP method 
can achieve better results compared with SIFT Flow, 
which is consistent with the results presented in [3]. 
Since DSP removes the neighboring-pixel connection 
at the pixel-level matching optimization, pixels tend to 
match dispersedly beyond the object boundary. 

The CSH method finds the patch matches mainly 
on patch appearance similarity and does not rely on the 
image coherence assumption. Their results are visually 
more pleasant (the warping result is highly similar to 
the test image). However, object pixels in a test image 
can easily incorrectly match the background pixels in 
the exemplar image. This causes the low label-transfer 
accuracy. 

Our method takes advantages of these three meth¬ 
ods. Grid-cell layer matching in our method considers 
the matching cost and geometric regularization in the 
pyramid of cells with different sizes, and cell match¬ 


ing guides the patch-level matching. Then the pixel 
matching refines the results of patch-level matching. 
Figure [5] shows that our method not only achieves ac¬ 
curate deformable matching but also keeps the object’s 
main shape. The matching accuracy by using our patch- 
level matching is better than that of DSP, while being 
much faster. As shown above, our patch-level results 
already outperform state-of-the-art results. Our pixel- 
level matching in general provides even better matching 
accuracy with more costly computation. 


3.2.3 Scene matching and segmentation 


In this scene matching and segmentation experiment, 


we report the results on the LMO dataset 11 . Most of 


the LMO images are outdoor scenes including streets, 
beaches, mountains and buildings. The dataset names 
top 33 object categories with the most labeled pixels. 
Other object categories are considered as the 34th cate¬ 


gory ‘unlabeled’ 11 . This experiment is more complex 


than the experiment on the Caltech-101 dataset, which 
contains only two labels (foreground or background). 
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Input GT label Ours DSP SF CSH 


Fig. 6 Example scene matching results of compared methods. This plot is displayed in accordance with Figure 5 ] The only 
difference here is that the scenes have multiple labels. Different pixel colors encode the 33 classes contained in the LMO dataset 
(e.g., tree, building, car, sky). 


For each test image, we select 9 most similar exem¬ 
plar images by Euclidean distances of GIST as in 3,11 


Through the matching process, we obtain dense cor¬ 
respondences between the test image and the selected 
exemplar images. Similar to the Caltech-101 matching 
experiment, we transfer the pixel labels to the test im¬ 
age pixels from the corresponding exemplar pixels. 

For the scene matching, some example results are 
shown in Figure [6j Again, our method is more robust 
to the image variations (scene appearance changes and 
structure changes) not only in scene warping but also 
in label transferring. Our labeling results appear more 
similar to the test image ground truth labels. 

For the scene segmentation, we follow the method 
described in 11 . After the matching and warping pro¬ 
cess, each pixel in the test image may have multiple 
labels by matching different exemplars. To obtain the 


method 

Ours 

DSP 

SIFT Flow 

CSH 

LT-ACC 

0.702 

0.677 

0.687 

0.612 

IOU 

0.505 

0.498 

0.479 

0.365 


Table 3 Scene segmentation performance of different meth¬ 
ods on the LMO dataset in matching accuracy. For our 
method, the dictionary is learned by using K-means and the 
encoding schemes is K-means triangle. The dictionary size is 
set to 100. The best results are boldfaced. 


final image segmentation, we reconcile multiple labels 
and impose spatial smoothness under a Markov ran¬ 
dom field (MRF) model. The label likelihood is defined 
by the feature distance between test image pixel and 
its corresponding exemplar image pixel. In this experi¬ 
ment, we randomly pick 40 images as test images. We 
report the patch-level results of our method on this 
dataset. 
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Table [3] shows the segmentation accuracy of our 
method compared with state-of-the-art methods. Our 
method outperforms state-of-the-art methods in the 
segmentation accuracy. In this experiment, we notice 
that SIFT Flow outperforms the DSP method in GIST 
neighbors, which is consistent with the results in [3]. 
The segmentation accuracy of DSP relies on the exem¬ 
plar list [3] . Our method does not have this problem. In 
the multi-class pixel labelling experiment, our method 
outperforms the compared methods by 0.02 of matching 
accuracy. Our IOU score is also better. The experimen¬ 
tal results show that our method provides higher match¬ 
ing accuracy. The reason is two-fold: (a) the learned fea¬ 
tures provide higher discriminability between classes, 
and (b) our matching model is more suitable for the 
learned features to carry out dense matching tasks. 


method 

Ours (patch layer) 

DSP 

SIFT Flow 

CSH 

LT-ACC 

0.729 

0.727 

0.711 

0.6173 

Time (s) 

0.18 

1.64 

9.38 

0.42 


Table 4 Intra-class image matching performance on the Pas¬ 
cal dataset. The best results are in bold. 


SIFTFLOW 

HB dsp 

I Ours 



(a) 


3.2.4 Matching results on the Pascal VOC dataset 


This part of experiments is carried out on the Pas¬ 


cal Visual Object Classes (VOC) 2012 dataset 12 


There are 2913 images in the segmentation task, which 
have ground-truth annotations for each image pixel. 
There are 20 specified object classes in the Pascal 2012 
dataset. The objects which do not belong to one of these 
classes are given a ‘background’ label. 

In our experiment, we random choose 30 image pairs 
for each class (600 image pairs in total) from those im¬ 
ages which only contain one object class. Each pair of 
images come from the same class. We consider the ob¬ 
jects as ‘foreground’ and others as ‘background’. The 
parameter setting is the same as our experiments on 
the Caltech-101 dataset. We use the same dictionary to 
obtain our pixel features as in the other experiments. 

Table [4] shows the matching accuracy and CPU time 
of our method as well as those compared methods. 
Again, we can see that our method achieves the high¬ 
est label transfer accuracies and is more than 8 times 
faster than the DSP method. The pie chart in Figure 
|7(a)| shows the percentage of each method achieving 
the best accuracy in 20 classes. We see that the pro¬ 
posed method achieves the best matching accuracy in 
10 classes, which outperforms the compared methods. 
Figure [7(b) | shows the histogram of best matching ac¬ 
curacy in all classes. 

The results on the Pascal dataset are slightly differ¬ 
ent from those on the Caltech-101 dataset. As we can 
see from Figure [7(b)| most of the classes in the Pascal 
dataset are low-accuracy classes. It means that objects 
in the same class vary more than those in the Caltech- 
101 dataset. In this experiment, our method still out¬ 
performs other methods for most cases. This confirms 
that our method achieves better matching results under 



(b) 


Fig. 7 


(a) 


shows the percentage of each method achieving 


the best accuracy on the Pascal dataset. Our metho d ac hieves 
the best matching accuracy in 10 classes out of 20. |(b)| shows 
the histogram of each methods’ achievements in matching 
accuracy (LT-ACC) in all classes. 


large object appearance variations. It is further demon¬ 
strated that our proposed method is more suitable to 
handle the matching problem of large intra-class vari¬ 
ability than those compared methods. 

Figure [3] shows some example results of compared 
methods. The results show that our method is more 
robust than other methods under large object appear¬ 
ance variations (e.g., first example) and cluttered back¬ 
grounds (e.g., third and fourth examples). 


3.3 Analysis of feature learning 

In this section, we examine several factors that may 
affect the performance of our proposed matching algo¬ 
rithms. We randomly pick 20 pairs of images for each 
object class on the Caltech-101 dataset (thus in to¬ 
tal 2020 pairs of images). The parameters are set as 
following unless otherwise specified. We use K-means 
dictionary learning and K-means triangle encoding for 
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Input GT label Ours DSP SIFT Flow CSH 


Fig. 8 Qualitative comparison. We show some example results of compared methods on the Pascal dataset. The first column 
stacks matching image pairs. The second column are ground truth label of input images. Columns 3-6 show the warping results 
from the exemplar image to the test image via pixel correspondences by our method, DSP, SIFT flow and CSH, respectively. 


our method. The dictionary is learned from 10 6 image 
patches extracted from 200 Background_Google class 
images in Caltech-101. The dictionary size is set to 100. 
The patch size for extracting pixel features is 11 x 11- 
pixels region centered at that pixel. Patch features are 
then calculated by max-pooling pixel features within 
each non-overlapping patch of 7 x 7 pixels. 


3.3.1 Evaluation of different dictionary learning 
methods and encoding schemes 


Here we examine the importance of dictionary learn¬ 
ing and feature encoding methods with respect to the 
final dense matching accuracy. First, we compare the 
K-means dictionary learning method, which has been 
used in our experiments in the previous section, with 
two other dictionary learning algorithms, namely K- 
SVD 30 and randomly sampled patches (RA) [8]. 

As we can see from Table [5j different dictionary 
learning methods do not have a significant impact on 
the final the matching results. Even using randomly 
sampled patches as the dictionary can achieve encour¬ 
aging matching performance. Different learning meth¬ 
ods lead to similar matching accuracies. As concluded 


Encoder 

Dictionary 

KT 

K-SVD 

RA 

KT 

K-means 

OMP-K SA 

K-means 

LT-ACC 

0.789 

0.792 

0.803 

0.755 

0.667 

IOU 

0.467 

0.481 

0.505 

0.386 

0.014 

LOC-ERR 

0.354 

0.336 

0.324 

0.497 

1.621 


Table 5 Object matching performance using different dic¬ 
tionary learning and encoding methods on Caltech-101 in 
matching accuracy. Definition of the acronyms: KT (K-means 
triangle), OMP-K (orthogonal matching pursuit), SA (soft 
assignment), RA (random sampling). 


in [8], the main value of the dictionary is to provide ba¬ 
sis, and how to construct the dictionary is less critical 
than the choice of encoding. For the application of pixel 
matching, we show that this conclusion holds too. 

Then, we compare three encoding schemes: K-means 
triangle (KT) [I], OMP-k [8] and soft assignment 
(SA) [34] to evaluate the impact of different encoding 
schemes. We apply the OMP encoding with a sparsity 
level k = 10. According to the Table [5j KT encod¬ 
ing achieves the best matching result, which marginally 
outperforms other encoding methods. It shows that the 
encoding scheme has a larger impact on the feature per¬ 
formance than dictionary learning. 
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In our model, there are two properties that pixel 
features should obtain. The first is sparsity. Since we 
use max-pooling to form our patch features, some de¬ 
gree of sparsity contributes to the improvement. Lack 
of sparsity in pixel features may decrease the power of 
patch features. Based on the KT method, roughly one 
half of the features will be set to 0. SA results in dense 
features and it performs very poorly in our framework. 
This is very different from image classification applica¬ 
tions 


34 


The other property is smoothness. As mentioned 


in 10 , they use the congealing method to align face 
images, which reduces entropy by performing local 
hill-climbing in the transformation parameters. The 
smoothness of that optimization landscape is a key fac¬ 
tor to their successful alignment. In our method, we 
might have faced similar situations. To find the dense 
correspondence, we optimize the object function via be¬ 
lief propagation (BP). Without the smoothness, the al¬ 
gorithm can easily get stuck at a local minimum. OMP- 
k features are more sparse than the KT features, but 
OMP-k features are not sufficiently smooth to perform 
well in our framework. 

The results in Table [5] show that KT encoding per¬ 
forms better than other two methods in our framework. 
This observation deviates from the case of generic image 
classification, in which many encoding methods (KT, 
SA, sparse coding, soft thresholding, etc.) have per¬ 
formed similarly [8 34 . 


3.3.2 Evaluation of the dictionary size 

In this subsection, we evaluate the impact of the dic¬ 
tionary size on the performance of dense matching. As 
dictionary size equals to the feature dimension, larger 
dictionary implies more patch information for each fea¬ 
ture, which may lead to better matching performance. 
At the same time, the trade-off is that a lager feature 
dimension requires more computation and slows down 
the matching procedure. 

We evaluate six dictionary sizes (64, 100, 144, 196, 
289, 400) on object matching performance, while keep¬ 
ing other parameters fixed. As shown in Table [6j using 
a dictionary of size 196 considerably outperforms the 
matching performance of size 64. Beyond this point, 
we observe slightly decreased accuracies. However, the 
CPU time grows tremendously from 0.04 seconds per 
image matching with a dictionary size of 64 to 0.12 sec¬ 
onds using a dictionary of size 196, as shown in Table 
Hit can be seen that even using a dictionary size of 64 
we can achieve high accuracy performance. This exper¬ 
iment result shows that using a bigger dictionary size 
(longer feature length) leads to slightly better matching 


accuracy in a wide range. Based on these observations, 
we have chosen the dictionary size to be 100 in our ob¬ 
ject matching and scene segmentation experiments as a 
balance of the accuracy and CPU time. 

Also note that with this choice, our feature dimen¬ 
sion is actually smaller than the SIFT’s dimension of 
128. This fact has contributed to the faster speed of 
our method, compared to methods like SIFT flow. 


3.3.3 Effect of the patch size for extracting pixel 
features 


This experiment considers the effect of patch size for 
extracting pixel features (from 5 x 5 to 27 x 27 pixels, 
we evaluate 12 patch sizes). For each image pixel, we ex¬ 
tract a certain size patch around that pixel and obtain 
pixel feature by using KT encoding. Larger patch re¬ 
gions allow us to extract more complex features as they 
may contain more information. On the other hand, it 
increases the dimensionality of the space that the algo¬ 
rithm must cover. The results are shown in Figure |9(a)| 
and Figure |9(c)| Overall, larger patches for extracting 
pixel features lead to better matching accuracy. From 
5x5tol7xl7 pixels, the matching performance in¬ 
creases significantly, while beyond 17 x 17 pixels, the 
accuracy improvement becomes negligible. From Fig¬ 
ure we can see that the impact of the patch size 
for extracting pixel features is much greater than the 
choice of dictionary sizes. As shown in Figure 9(c)| the 
best performance of 5 x 5 patch size is obtained by us¬ 
ing a dictionary size of 100. Meanwhile, for patch size 
of 11 x 11, best performance point shifts to the dictio¬ 
nary size of 196. This suggests that one should choose 
a larger dictionary when larger patches are used. 


3.3.4 Impact of the pooling size at the patch layer 


We examine the impact of the pooling size for obtaining 
the patch layer features. As described in Section |2T| an 
image is divided into non-overlapping pooling regions. 
Each of those pooling region/patches is represented by 
a patch feature, which is obtained by max-pooling all 
pixel features within that patch. As a result, larger- 
size pooling regions (patches) result in fewer pooled 
features. 


The experiment result is shown in Figure 9(b) [ We 
consider four pooling sizes (3x3, 7x7, 11x11, 15 x 15 
pixels). The best performance is achieved by the pooling 
size of 7 x 7 pixels, regardless of the region size for 
extracting pixel layer features. 
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Method 
Dictionary Size 

64 

100 

Ours 

144 196 

289 

400 

DSP 

SIFT Flow 

CSH 

LT-ACC 

0.764 

0.765 

0.767 

0.769 

0.767 

0.765 

0.757 

0.741 

0.594 

CPU time (s) 

0.041 

0.052 

0.065 

0.118 

0.523 

1.17 

0.557 

2.83 

0.165 


Table 6 Evaluation of different dictionary sizes w.r.t. the final matching accuracy. We can see that with a dictionary size of 
100, our method has already outperformed DSP in accuracy and CPU time. 





Fig. 9 The imp act of changes in model setup of feature learn¬ 
ing process. |(a)| shows the matching accuracy usin g different 
patch sizes for extracting pixel layer features. |(b)| shows the 
impact of the max-pooling size for obtaining the patch layer 
features. |(c)| shows the performance of different patch sizes 
and dictionary sizes. The impact of the patch size is greater 
than the changes of the dictionary size. Beyond the point of 
patch size 17 x 17, the gain of increasing the patch size of the 
patch layer becomes less noticeable. 



Fig. 10 Matching accuracies with varying amount of train¬ 
ing data and varying dictionary sizes. 

3.3.5 Impact of the size of training data 

In all the previous experiments, the dictionary is 
learned form 10 6 patches extracted from 200 Back- 
ground_Google class images in the Caltech-101 dataset. 
In this experiment, we evaluate the impact of the train¬ 
ing data size. Multiple dictionaries are learned from dif¬ 
ferent numbers of patches in the Background_Google 
class of Caltech-101 dataset. The matching accuracy 
increases very slightly with more sampled patches for 
training, which is expected. 


4 Conclusions 

We have proposed to learn features for pixel corre¬ 
spondence estimation in an unsupervised manner. A 
new multi-layer matching algorithm is designed, which 
naturally aligns with the unsupervised feature learn¬ 
ing pipeline. For the first time, we show that learned 
features can work better than those widely-used hand¬ 
crafted features like SIFT on the problem of dense pixel 
correspondence estimation. 

We empirically demonstrate that our proposed algo¬ 
rithm can robustly match different objects or scenes ex¬ 
hibiting large appearance differences and achieve state- 
of-the-art performance in terms of both matching ac¬ 
curacy and running time. A limitation of the proposed 
framework is that currently the system is not very ro¬ 
bust to rotation and scale variations. We want to pursue 
this issue in future work. 
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We have made the code online available at https: 

//bitbucket.org/chhshen/uf1. 
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